Standardized Phylogenetic Classification of Human Respiratory Syncytial Virus below the Subgroup Level

A globally implemented unified phylogenetic classification for human respiratory syncytial virus (HRSV) below the subgroup level remains elusive. We formulated global consensus of HRSV classification on the basis of the challenges and limitations of our previous proposals and the future of genomic surveillance. From a high-quality curated dataset of 1,480 HRSV-A and 1,385 HRSV-B genomes submitted to GenBank and GISAID (https://www.gisaid.org) public sequence databases through March 2023, we categorized HRSV-A/B sequences into lineages based on phylogenetic clades and amino acid markers. We defined 24 lineages within HRSV-A and 16 within HRSV-B and provided guidelines for defining prospective lineages. Our classification demonstrated robustness in its applicability to both complete and partial genomes. We envision that this unified HRSV classification proposal will strengthen HRSV molecular epidemiology on a global scale.

A globally implemented unified phylogenetic classification for human respiratory syncytial virus (HRSV) below the subgroup level remains elusive.We formulated global consensus of HRSV classification on the basis of the challenges and limitations of our previous proposals and the future of genomic surveillance.From a high-quality curated dataset of 1,480 HRSV-A and 1,385 HRSV-B genomes submitted to GenBank and GISAID (https://www.gisaid.org)public sequence databases through March 2023, we categorized HRSV-A/B sequences into lineages based on phylogenetic clades and amino acid markers.We defined 24 lineages within HRSV-A and 16 within HRSV-B and provided guidelines for defining prospective lineages.Our classification demonstrated robustness in its applicability to both complete and partial genomes.We envision that this unified HRSV classification proposal will strengthen HRSV molecular epidemiology on a global scale.
as a tool to monitor their efficacy.Standards for HRSV nomenclature for sharing of viral isolates and sequences in databases have been published (4).Nevertheless, a standardized HRSV phylogenetic classification system has yet to be defined and implemented.
In 2022, HRSV was designated as Orthopneumovirus hominis species within the Pneumoviridae family.Below species level are 2 antigenic groups, known as HRSV subgroup A (HRSV-A) and B (HRSV-B), that were previously referred to as subtypes (4)(5)(6).Within each subgroup, genotypes were initially defined based on statistically supported phylogenetic clades inferred with the second hypervariable region of the G gene (Figure 1, panels A, B) (7).The G gene, encoding the attachment glycoprotein, exhibits the highest genetic and antigenic variability.Of note, the gene has undergone a duplication of a 72-nt fragment in HRSV-A and 60-nt fragment in HRSV-B (Figure 1, panel B) (8,9).
To identify emerging genotypes, researchers have used genetic distances between phylogenetic clades and distinctive genetic features, accompanied by variable nomenclature based on the gene (GA1-GA7 in HRSV-A and GB1-GB4 in HRSV-B), country and subgroup (SAB1-SAB4 for South African genotypes in HRSV-B), or city and province (NA1-NA2 [Niigata] and ON1 [Ontario] in HRSV-A, BA1-BA9 [Buenos Aires] in HRSV-B) (7)(8)(9)(10)(11)(12)(13)(14)(15)(16).Since 2020, alternative phylogenetic reclassifications have been proposed; Goya et al. established a hierarchical classification system for HRSV phylogenies, comprising genotypes, subgenotypes, and lineages, using the G gene (17).That framework enabled laboratories without capacity for whole-genome sequencing to conduct molecular epidemiology studies.Independently, Ramaekers et al. (18) proposed reclassifications into lineages and Chen et al. (19) into genotypes using complete HRSV genomes.Those approaches support comprehensive monitoring of viral evolution across all genes, including the F gene encoding the fusion protein, a crucial target for monoclonal antibodies and the foundation of approved HRSV vaccines (Figure 1, panel A).Of note, challenges in HRSV molecular epidemiology persisted within the reclassification-defined categories because of reliance on genetic or patristic distances between tree tips or nodes.(20,21).We trimmed alignment ends to encompass complete genomes from the first codon of the first gene (NS1) to the last codon of the last gene (L).We considered partial genomes if the lack of sequence was within 50 nt of the genome ends.We used RSVsurver (https://rsvsurver.bii.a-star.edu.sg) to identify and remove genomes with nucleotide insertions or deletions causing frameshift in any open reading frame.After alignment trimming, detection of identical sequences prompted redundancy removal using BBmap (https://jgi.doe.gov/data-and-tools/software-tools/bbtools), resulting in the final set of 1,538 HRSV-A and 1,387 HRSV-B genomes (Appendix 1 Figure 1).

Baseline Agreements on the HRSV Classification Definition
Our proposed classification establishes HRSV lineages for viruses below subgroup level.Studies have shown that HRSV phylogenetic trees constructed with complete genomes exhibit superior resolution (17)(18)(19).Therefore, we defined a classification system based on maximum-likelihood phylogenetic trees inferred from complete HRSV genomes.The maximumlikelihood algorithm formulates hypotheses about the evolutionary relationships among sequences; the implementation within IQ-TREE dealing with large datasets makes it particularly well suited to assert HRSV genomic phylogeny including sequences collected >50 years ago (22).We defined complete HRSV genomes to the nucleotide sequences spanning from the first codon of the first gene (NS1) to the last codon of the last gene (L).We considered almost-complete genomes if the sequence information gaps were within a 50-nt window at the genome ends.To define lineages, we only used genomes without nucleotide ambiguities (in accordance with the IUPAC code for nucleotide degeneracy).

Accurate Root Placement in HRSV Phylogenetic Trees
We reconstructed maximum-likelihood phylogenetic trees for the HRSV-A and HRSV-B datasets.We used 2 approaches to root the trees: the use of an outgroup, a conventional method for inferring the tree root using sequences known to be evolutionarily distant; and phylodynamic analysis, integrating temporal and phylogenetic patterns in virus evolution (Appendix 1).Both approaches consistently identified the same root for each subgroup cluster (Appendix 1 Figure 3).Phylodynamic analysis also identified 58 outlier sequences for HRSV-A and 2 for HRSV-B that were excluded from lineage designation.The final dataset considered for lineage designation comprised 1,480 HRSV-A and 1,385 HRSV-B genomes (Appendix 2 Table, https://wwwnc.cdc.gov/EID/article/30/8/24-0209-App2.xlsx).

HRSV Lineage Definition
We defined HRSV lineage as a statistically supported monophyletic cluster comprising >10 sequences and characterized by >5 aa substitutions, compared to the parental lineage.The lineage-defining amino acids, present in >90% of the sequences within the clade, may be found in any of the viral proteins.
Phylogenetic classifications vary among viral species aiming to define clusters reflecting the heterogeneity of the viral population, considering each virus unique evolutionary characteristics and using arbitrary thresholds for long-term applicability (27)(28)(29).Inherent bias exists in any classification system because of availability and spatiotemporal representation sequences.Therefore, our HRSV lineage definition did not include criteria of sequences from different outbreaks or countries to enable early detection of novel lineages.However, we propose establishing a threshold of >10 genomes for defining a lineage to monitor HRSV strains circulating within communities.
We observed the presence of distinctive signature amino acids shared by sequences of a phylogenetic clade in comparison to the parental lineage is a simple method to identify a new lineage.Methods (i.e., average nucleotide genetic distances, average patristic distances, or patristic distances between nodes) need phylogenies with complete datasets to define new categories, becoming complex with rapid increases of available sequences (16)(17)(18)(19).In our proposal, we initially screened different amino acid thresholds in an automated manner, ranging from 1-10 lineage-defining amino acids (Appendix 1).The number of small lineages decreased as the number of lineage-defining amino acids increased, and 5 amino acids resulted in an intermediate complexity of lineages defined for both HRSV subgroups.Furthermore, we proposed that the lineage-defining amino acids should be conserved in >90% of the genomes within a clade, considering the potential reversion in some of the genomes within highly mutated hotspot sites.We acknowledged that other numbers of genomes or amino acids thresholds could be useful, but we emphasized that the key to establishing a global consensus is clear operational guidelines and a robust classification, 2 aspects that our proposal fulfills.

HRSV Lineage Nomenclature
We defined the lineage nomenclature integrating the HRSV subgroup letter and ascending ordinal numbers, separated by dots to represent nested lineages (Figure 3, panels A, B; Figure 4, panels A, B).Furthermore, we assigned a distinct nomenclature to the 72-nt ( 24  To remain functional, a nomenclature system requires periodic updates as new lineages emerge.Therefore, we have established 2 open repositories on GitHub containing definitions of each lineage, signature mutations, and representative sequences.The repositories are available at https://github.com/rsv-lineages/lineage-designation-Aand https://github.com/rsv-lineages/lineage-designation-B; they are intended to provide up-to-date definitions and serve as a platform for discussion and designation of novel lineages.

Lineages within the HRSV-A and HRSV-B Rooted Trees
We reconstructed ancestral sequences at the root of the phylogenetic trees.Although the sequences are not biologically real, they served as surrogate parental lineages during initial classification.Identifying monophyletic clusters with >10 sequences and >5 aa changes compared with the reconstructed root sequence, we defined 3 HRSV-A lineages (A.1-A.3)and 4 HRSV-B lineages (B.1-B.4).We were unable to classify 2 sequences, EPI-ISL-15771600_USA_1956 (GISAID) and MG642074_USA_1980 (GenBank), perhaps because they belong to underrepresented extinct lineages.
We further analyzed the first lineages in an iterative manner to identify nested lineages; as a result, we identified a total of 24 lineages within HRSV-A, and 16 within HRSV-B (Figures 3, 4).Close to the root of the HRSV-B tree, extinct lineages were underrepresented, comprising <10 sequences but featuring >5 distinct amino acids (B.1, B.3, B.4).Despite the low number of sequences, we included them as lineages to trace evolutionary branches that gave rise to currently circulating lineages.In addition, A.D.2 is slightly below the sequence threshold; nonetheless, we kept the lineage category to emphasize the common ancestor among A.D.2.1 and A.D.2.2.
We scrutinized the presence and absence of the duplication in the G gene across each tree.Although patterns were mostly as expected with a single historical duplication event, some genomes within the clade with the duplication in G lacked the duplication.The dispersed association of these sequences in the phylogenetic tree, rather than the monophyletic cluster we expected, suggests the virus did not lose the nucleotide duplication (Appendix 1 Figure 4).Instead, similar read length to the duplication region of certain short-read next-generation sequencing technologies potentially masked the presence of the duplication when used in the consensus genome assembly with reference sequences that do not possess the nucleotide duplication.Therefore, we recommend using such data with quality filtered reads of a length >150 nt to avoid this problem.Lineage-defining amino acids were present in all HRSV proteins, primarily identified within the G protein (Tables 1, 2).Also, the lineage-defining amino acids at polymerase L protein were noteworthy, contributing to the distinction of 21 of 24 HRSV-A lineages and 15 of 16 HRSV-B lineages (Tables 1, 2).Of interest, the F protein contributed to define 14 lineages in HRSV-A and 13 in HRSV-B (Figure 3, panel B; Figure 4, panel B).The G and F surface glycoproteins are likely under selection pressure from antibodymediated immunity and exhibit a robust phylogenetic signal (18,31).Whereas the G protein displays substantial nucleotide and amino acid sequence plasticity, the F protein experiences strong negative selection, likely attributed to functional or structural constraints (34).For instance, the fusion peptide is the only region in F without lineage-defining amino acids (Figure 3, panel B; Figure 4, panel B).Although the low diversity of the F protein is promising for HRSV interventions, monitoring the F protein during global implementation is essential to estimate the antigenic impact of amino acid substitutions.

Using G and F Sequences with the HRSV Lineage Classification System
The main challenge for global expansion of HRSV genomics is the absence of a cost-effective, globally standardized and validated methodology for sequencing, in contrast to SARS-CoV-2 or influenza virus (35,36).In addition, limited funding and infrastructure cause some laboratories to prefer sequencing the G gene only (37)(38)(39).Although we highly recommend using complete genomes for HRSV lineage assignment to ensure the maximum accuracy of the classification and monitor the amino acid changes in all viral proteins, partial genomes covering the G and F genes can be used because overall they reproduce the topology of the HRSV tree (17,18).We do not recommend the use of smaller G gene regions such as the second hypervariable region (250-nt length at the 3′ gene end) (Figure 1) that was used historically for molecular epidemiology because previous reports showed a decreased phylogenetic signal (17).The use of G, F, or both genes for lineage classification should rely on phylogenetic associations with reference sequences.Of note, using only G and F genes is inadequate for defining novel lineages because of the inability to detect lineage-defining amino acids across all viral proteins.Our analysis showed minimal misclassification (1.2%) in HRSV-A and none in HRSV-B when using only the G gene (Appendix 1 Figure 5).However, the G ectodomain alone resulted in an 18.86% misclassification rate for HRSV-A and none for HRSV-B.The F gene alone had misclassification rates of 38.18% for HRSV-A and 1.23% for HRSV-B because of polytomies affecting lineage assignments within A.D.1 and A.D.5.Combining G and F gene fragments reduced misclassification to 0.07% for HRSV-A and none for HRSV-B, indicating that this approach provides optimal resolution for both subgroups (Appendix 1 Figure 5).
Importantly, assigning the lineage of a query sequence does not require the use of complete genomes or the absence of nucleotide ambiguities; rather, it requires a supported association within a phylogenetic clade.However, defining a new lineage requires the use of complete genomes without ambiguities, because amino acid characterization of all viral proteins is essential.

Molecular Epidemiology of HRSV with Proposed Classification
We described the HRSV molecular epidemiology including all available genomes, even those previously discarded during the dataset curation.We analyzed the seasonality of lineages using a dataset comprising 2,277 HRSV-A and 2,058 HRSV-B genomes, revealing notable co-circulation and lineage replacement over time (Figure 5).In HRSV-A, A.

Discussion
Consensus classification of HRSV below the subgroup level has been a challenge for multiple decades.Collaboratively, the HRSV molecular evolution research community, along with experts in the evolution of other respiratory viruses, have worked toward establishing a unified global classification system in the initiative HRSV Genotyping Consensus Consortium (RGCC).Our proposal categorizes HRSV-A/B sequences into lineages based on phylogenetic associations and amino acid markers, relying on complete genomes.Partial or low-quality genomes can be assigned to the existing lineages, emphasizing the robustness of this system.We developed standard guidelines for lineage definition and assignment and created online resources for updates, ensuring longterm utility.Defining a viral category below species through a phylogenetic-based classification is challenging; the system must exhibit reproducibility, balance complexity, and be updatable to capture the level of heterogeneity useful for viral surveillance.Our proposal addresses those requirements comprehensively.HRSV is not an emerging virus; it generates annual outbreaks with co-circulation and replacement in the prevalence of its antigenic subgroups.Although some HRSV genomes were collected from clinical samples >50 years ago, the largest increase in the number of genomes has occurred since 2021.A limitation of our definition is the uncertainty of the antigenic effect of individual amino acid substitutions on lineages.Hence, whole-genome surveillance together with the study of lineage-phenotype association are essential, as observed in genetic and antigenic characterization in influenza to estimate the effectiveness of immunization (47).In 2023, recombinant F protein vaccines were approved; as their implementation progresses, we will learn how the vaccines affect viral evolution.We expect our unification proposal for the phylogenetic classification of HRSV to support spatiotemporal comparative lineage surveillance and detection of emerging lineages.In addition, we anticipate studies of association between lineages and the severity of HRSV disease, as well as associations of particular lineages with patients' demographic characteristics.
Definition Applying the established baseline agreements, we gathered 1,538 HRSV-A and 1,387 HRSV-B high-quality genomes from public databases.The dataset revealed a limited global HRSV genomic surveillance; <20 genomes deposited annually through 2007 (Figure 2, panel A; Appendix 1 Figure 2).Since 2008, the number of genomes and representation of countries improved; a surge occurred after 2021, probably driven by expansion of viral genomics since the SARS-CoV-2 pandemic and the approval of the HRSV prophylactic treatments (Figure 2, panel A; Appendix 1 Figure 2).Considering delays in genome deposition in public databases, the number of genomes in 2022 may be higher than those used in this study.Regarding geographic representation, 9 countries (Australia, United Kingdom, New Zealand, United States, Argentina, Kenya, Morocco, Netherlands, and Brazil) submitted >100 genomes; only the United Kingdom achieved uninterrupted surveillance since 2008, but Australia deposited the most genomes globally (Figure 2, panel B).

Figure 2 .
Figure 2. The global HRSV genomics surveillance landscape.HRSV genomes from GenBank and GISAID (https://www.gisaid.org) databases through March 11, 2023, that met inclusion criteria used for classification are shown by year of sample collection and subgroup (A) and by country of origin (B).HRSV, human respiratory syncytial virus.
-aa) G-gene duplication within HRSV-A and 60-nt (20-aa) G-gene duplication within HRSV-B.Those genetic events are epidemiologically relevant, because only viruses with G-gene duplication have been detected since 2017 (30-33).To track those viruses, we used the alias D, specifically A.D (historically, ON1 genotype) for HRSV-A and B.D (historically, BA genotype), for HRSV-B and nested lineages with increasing ordinal numbers.In summary, letters A and B indicate the HRSV subgroup at the beginning of the lineage name, C is unused, and D serves as an alias for 72-nt and 60-nt duplication within the G gene.In addition, aliases starting from E are limited to 3 numerical levels of nested lineages, preventing indefinite accumulation of numbers.For example, B.D.4.1.1 lineage has descendant lineages named B.D.E.1-B.D.E.4 instead of B.D.4.1.1.1-B.D.4.1.1.4,where E represents 4.1.1(Figure 4, panels A, B).The nomenclature is based on the tree topology, reflecting the order of the nodes from the root to the tips, but it is unrelated to the sequence collection date or date of the most recent common ancestor of the lineage.

Figure 3 .
Figure 3. Human respiratory syncytial virus A lineage classification.A) HRSV-A maximum-likelihood phylogenetic tree (1,480 sequences), colored by lineage classification.Black star indicates A.D lineage, defined by the 72-nt duplication in the G gene.Scale bar indicates substitutions per site.B) Simplified scheme of the lineage designation to highlight the presence of nested lineages.The amino acid changes in the F glycoprotein are listed next to lineage name and colored according to their location in the fusion protein.

Figure 5 .
Figure 5. Temporal distribution of HRSV-A and HRSV-B lineages.A total of 2,744 HRSV-A genomes and 2,443 HRSV-B genomes available in public databases through March 2023 were included.HRSV, human respiratory syncytial virus.
Contemporary lineages such as B.D.4.1.1 and descendants B.D.E.1 and B.D.E.3, predominantly consisted of sequences from the United Kingdom.Global genomic surveillance bias presents a major confounding factor in lineage geodetection; for instance, most of the earliest lineages were detected in the United States, the principal contributor of HRSV genomes until 2007 (Appendix 1 Figures 2, 6).